{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2019 Google LLC\n",
    "# \n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/GoogleCloudPlatform/keras-idiomatic-programmer/blob/master/community-labs/Community Lab - Ensemble.ipynb\">\n",
    "<img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
    "\n",
    "For best performance using Colab, once the notebook is launched, from dropdown menu select **Runtime -> Change Runtime Type**, and select **GPU** for **Hardware Accelerator**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Composable \"Design Pattern\" for AutoML friendly models\n",
    "\n",
    "## Community Lab 3: Ensemble Training\n",
    "\n",
    "### Objective\n",
    "\n",
    "To replace a traditional \"inter-model\" ensemble of models of high complexity with an \"intra-model\" ensemble of lower complexity, while retaining the performance benefits.\n",
    "\n",
    "*Question*: Can one achieve the same performance with intra-model bagging vs. traditional inter-model ensemble?\n",
    "\n",
    "*Question*: Can one achieve the same performance with intra-model stacking vs. traditional inter-model ensemble?\n",
    "\n",
    "\n",
    "### Approach\n",
    "\n",
    "We will use the composable design pattern, and prebuilt units from the Google Cloud AI Developer Relations repo: Model Zoo\n",
    "\n",
    "We will use the composable design pattern, and prebuilt units from the Google Cloud AI Developer Relations repo: [Model Zoo](https://github.com/GoogleCloudPlatform/keras-idiomatic-programmer/tree/master/zoo)\n",
    "\n",
    "If you are not familiar with the Composable design pattern, we recommemd you review the [ResNet](https://github.com/GoogleCloudPlatform/keras-idiomatic-programmer/tree/master/zoo/resnet) model in our zoo.\n",
    "\n",
    "We recommend a constant set for hyperparameters, where batch_size is 32 and initial learning rate is 0.001 -- but you may use any value for hyperparameters you prefer.\n",
    "\n",
    "\n",
    "### Reporting Findings\n",
    "\n",
    "You can contact us on your findings via the twitter account: @andrewferlitsch\n",
    "\n",
    "### Dataset\n",
    "\n",
    "In this notebook, we use the CIFAR-10 datasets which consist of images 32x32x3 for 10 classes -- but you may use any dataset you prefer.\n",
    "\n",
    "### Steps\n",
    "\n",
    "1. Build and train a baseline (single instance) model for CIFAR-10.\n",
    "\n",
    "2. Build and train two more baseline model instances (three in total), each with a different draw for weight initialization.\n",
    "\n",
    "3. Construct an inter-model ensemble from the trained model instances and evaluate it.\n",
    "\n",
    "4. Observe the weight variances between the trained model instances.\n",
    "\n",
    "5. Evaluate an interchange of the top layer weights between the trained model instances and observe the performance.\n",
    "\n",
    "6. Build and train an intra-model bagging model ensemble.\n",
    "\n",
    "7. Build a wrapper model to weight parameterize the intra-model bagging ensemble (majority voting).\n",
    "\n",
    "8. Evaluate the intra-model bagging wrapper model.\n",
    "\n",
    "9. Build and train an intra-model stacking model ensemble.\n",
    "\n",
    "10. Evaluate the intra-model stacking model ensemble."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab\n",
    "\n",
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow.keras import Input, Model\n",
    "from tensorflow.keras.layers import Conv2D, Flatten, Conv2DTranspose, ReLU, Add, Dense, Dropout, GaussianNoise\n",
    "from tensorflow.keras.layers import BatchNormalization, GlobalAveragePooling2D, Activation, Concatenate\n",
    "from tensorflow.keras.optimizers import Adam\n",
    "from tensorflow.keras.regularizers import l2\n",
    "from tensorflow.keras.callbacks import LearningRateScheduler\n",
    "from tensorflow.keras.datasets import cifar10\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get the Dataset\n",
    "\n",
    "Load the dataset into memory as numpy arrays, and then normalize the image data (preprocessing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.datasets import cifar10\n",
    "(x_train, y_train), (x_test, y_test) = cifar10.load_data()\n",
    "x_train = (x_train / 255.0).astype(np.float32)\n",
    "x_test  = (x_test / 255.0).astype(np.float32)\n",
    "print(x_train.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build Baseline Model for CIFAR-10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from resnet/resnet_v2_c.py\n",
    "\n",
    "class ResNetV2(object):\n",
    "    \"\"\" Construct a Residual Convolution Network Network V2 \"\"\"\n",
    "    # Meta-parameter: list of groups: number of filters and number of blocks\n",
    "    groups = { 50 : [ { 'n_filters' : 64, 'n_blocks': 3 },\n",
    "                      { 'n_filters': 128, 'n_blocks': 4 },\n",
    "                      { 'n_filters': 256, 'n_blocks': 6 },\n",
    "                      { 'n_filters': 512, 'n_blocks': 3 } ],            # ResNet50\n",
    "               101: [ { 'n_filters' : 64, 'n_blocks': 3 },\n",
    "                      { 'n_filters': 128, 'n_blocks': 4 },\n",
    "                      { 'n_filters': 256, 'n_blocks': 23 },\n",
    "                      { 'n_filters': 512, 'n_blocks': 3 } ],            # ResNet101\n",
    "               152: [ { 'n_filters' : 64, 'n_blocks': 3 },\n",
    "                      { 'n_filters': 128, 'n_blocks': 8 },\n",
    "                      { 'n_filters': 256, 'n_blocks': 36 },\n",
    "                      { 'n_filters': 512, 'n_blocks': 3 } ]             # ResNet152\n",
    "             }\n",
    "    init_weights = 'he_normal'\n",
    "    reg=l2(0.001)\n",
    "    _model = None\n",
    "\n",
    "    def __init__(self, n_layers, input_shape=(224, 224, 3), n_classes=1000):\n",
    "        \"\"\" Construct a Residual Convolutional Neural Network V2\n",
    "            n_layers   : number of layers\n",
    "            input_shape: input shape\n",
    "            n_classes  : number of output classes\n",
    "        \"\"\"\n",
    "        # predefined\n",
    "        if isinstance(n_layers, int):\n",
    "            if n_layers not in [50, 101, 152]:\n",
    "                raise Exception(\"ResNet: Invalid value for n_layers\")\n",
    "            groups = self.groups[n_layers]\n",
    "        # user defined\n",
    "        else:\n",
    "            groups = n_layers\n",
    "\n",
    "        # The input tensor\n",
    "        inputs = Input(input_shape)\n",
    "\n",
    "        # The stem convolutional group\n",
    "        x = self.stem(inputs)\n",
    "\n",
    "        # The learner\n",
    "        x = self.learner(x, groups=groups)\n",
    "\n",
    "        # The classifier \n",
    "        outputs = self.classifier(x, n_classes)\n",
    "\n",
    "        # Instantiate the Model\n",
    "        self._model = Model(inputs, outputs)\n",
    "\n",
    "    @property\n",
    "    def model(self):\n",
    "        return self._model\n",
    "\n",
    "    @model.setter\n",
    "    def model(self, _model):\n",
    "        self._model = _model\n",
    "\n",
    "    def stem(self, inputs):\n",
    "        \"\"\" Construct the Stem Convolutional Group \n",
    "            inputs : the input vector\n",
    "        \"\"\"\n",
    "        # The 224x224 images are zero padded (black - no signal) to be 230x230 images prior to the first convolution\n",
    "        x = ZeroPadding2D(padding=(3, 3))(inputs)\n",
    "    \n",
    "        # First Convolutional layer uses large (coarse) filter\n",
    "        x = Conv2D(64, (7, 7), strides=(2, 2), padding='valid', use_bias=False, \n",
    "                   kernel_initializer=self.init_weights, kernel_regularizer=self.reg)(x)\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "    \n",
    "        # Pooled feature maps will be reduced by 75%\n",
    "        x = ZeroPadding2D(padding=(1, 1))(x)\n",
    "        x = MaxPooling2D((3, 3), strides=(2, 2))(x)\n",
    "        return x\n",
    "\n",
    "    def learner(self, x, **metaparameters):\n",
    "        \"\"\" Construct the Learner\n",
    "            x     : input to the learner\n",
    "            groups: list of groups: number of filters and blocks\n",
    "        \"\"\"\n",
    "        groups = metaparameters['groups']\n",
    "\n",
    "        # First Residual Block Group (not strided)\n",
    "        x = ResNetV2.group(x, strides=(1, 1), **groups.pop(0))\n",
    "\n",
    "        # Remaining Residual Block Groups (strided)\n",
    "        for group in groups:\n",
    "            x = ResNetV2.group(x, **group)\n",
    "        return x\n",
    "    \n",
    "    @staticmethod\n",
    "    def group(x, strides=(2, 2), init_weights=None, **metaparameters):\n",
    "        \"\"\" Construct a Residual Group\n",
    "            x         : input into the group\n",
    "            strides   : whether the projection block is a strided convolution\n",
    "            n_filters : number of filters for the group\n",
    "            n_blocks  : number of residual blocks with identity link\n",
    "        \"\"\"\n",
    "        n_blocks  = metaparameters['n_blocks']\n",
    "\n",
    "        # Double the size of filters to fit the first Residual Group\n",
    "        x = ResNetV2.projection_block(x, strides=strides, init_weights=init_weights, **metaparameters)\n",
    "\n",
    "        # Identity residual blocks\n",
    "        for _ in range(n_blocks):\n",
    "            x = ResNetV2.identity_block(x, init_weights=init_weights, **metaparameters)\n",
    "        return x\n",
    "\n",
    "    @staticmethod\n",
    "    def identity_block(x, init_weights=None, **metaparameters):\n",
    "        \"\"\" Construct a Bottleneck Residual Block with Identity Link\n",
    "            x        : input into the block\n",
    "            n_filters: number of filters\n",
    "            reg      : kernel regularizer\n",
    "        \"\"\"\n",
    "        n_filters = metaparameters['n_filters']\n",
    "        if 'reg' in metaparameters:\n",
    "            reg = metaparameters['reg']\n",
    "        else:\n",
    "            reg = ResNetV2.reg\n",
    "\n",
    "        if init_weights is None:\n",
    "            init_weights = ResNetV2.init_weights\n",
    "    \n",
    "        # Save input vector (feature maps) for the identity link\n",
    "        shortcut = x\n",
    "    \n",
    "        ## Construct the 1x1, 3x3, 1x1 convolution block\n",
    "    \n",
    "        # Dimensionality reduction\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "        x = Conv2D(n_filters, (1, 1), strides=(1, 1), use_bias=False, \n",
    "                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)\n",
    "\n",
    "        # Bottleneck layer\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "        x = Conv2D(n_filters, (3, 3), strides=(1, 1), padding=\"same\", use_bias=False, \n",
    "                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)\n",
    "\n",
    "        # Dimensionality restoration - increase the number of output filters by 4X\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "        x = Conv2D(n_filters * 4, (1, 1), strides=(1, 1), use_bias=False, \n",
    "                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)\n",
    "\n",
    "        # Add the identity link (input) to the output of the residual block\n",
    "        x = Add()([shortcut, x])\n",
    "        return x\n",
    "\n",
    "    @staticmethod\n",
    "    def projection_block(x, strides=(2,2), init_weights=None, **metaparameters):\n",
    "        \"\"\" Construct a Bottleneck Residual Block of Convolutions with Projection Shortcut\n",
    "            Increase the number of filters by 4X\n",
    "            x        : input into the block\n",
    "            strides  : whether the first convolution is strided\n",
    "            n_filters: number of filters\n",
    "            reg      : kernel regularizer\n",
    "        \"\"\"\n",
    "        n_filters = metaparameters['n_filters']\n",
    "        if 'reg' in metaparameters:\n",
    "            reg = metaparameters['reg']\n",
    "        else:\n",
    "            reg = ResNetV2.reg\n",
    "\n",
    "        if init_weights is None:\n",
    "            init_weights = ResNetV2.init_weights\n",
    "\n",
    "        # Construct the projection shortcut\n",
    "        # Increase filters by 4X to match shape when added to output of block\n",
    "        shortcut = BatchNormalization()(x)\n",
    "        shortcut = Conv2D(4 * n_filters, (1, 1), strides=strides, use_bias=False, \n",
    "                          kernel_initializer=init_weights, kernel_regularizer=reg)(shortcut)\n",
    "\n",
    "        ## Construct the 1x1, 3x3, 1x1 convolution block\n",
    "    \n",
    "        # Dimensionality reduction\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "        x = Conv2D(n_filters, (1, 1), strides=(1,1), use_bias=False, \n",
    "                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)\n",
    "\n",
    "        # Bottleneck layer\n",
    "        # Feature pooling when strides=(2, 2)\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "        x = Conv2D(n_filters, (3, 3), strides=strides, padding='same', use_bias=False, \n",
    "                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)\n",
    "\n",
    "        # Dimensionality restoration - increase the number of filters by 4X\n",
    "        x = BatchNormalization()(x)\n",
    "        x = ReLU()(x)\n",
    "        x = Conv2D(4 * n_filters, (1, 1), strides=(1, 1), use_bias=False, \n",
    "                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)\n",
    "\n",
    "        # Add the projection shortcut to the output of the residual block\n",
    "        x = Add()([x, shortcut])\n",
    "        return x\n",
    "\n",
    "    def classifier(self, x, n_classes):\n",
    "        \"\"\" Construct the Classifier Group \n",
    "            x         : input to the classifier\n",
    "            n_classes : number of output classes\n",
    "        \"\"\"\n",
    "        # Pool at the end of all the convolutional residual blocks\n",
    "        x = GlobalAveragePooling2D()(x)\n",
    "\n",
    "        # Final Dense Outputting Layer for the outputs\n",
    "        outputs = Dense(n_classes, activation='softmax', \n",
    "                        kernel_initializer=self.init_weights, kernel_regularizer=self.reg)(x)\n",
    "        return outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def makeBaseModel(reg=None, n_blocks=4, lr=0.001, noise=None):\n",
    "    ResNetV2.reg = reg\n",
    "    \n",
    "    # Stem\n",
    "    inputs = Input((32, 32, 3))\n",
    "    x = Conv2D(32, (3, 3), strides=(1, 1), padding='same', \n",
    "               kernel_initializer='he_normal', kernel_regularizer=reg)(inputs)\n",
    "    x = BatchNormalization()(x)\n",
    "    x = ReLU()(x)\n",
    "\n",
    "    # Learner\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=16)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=64)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=128)\n",
    "\n",
    "    # Classifier\n",
    "    x = GlobalAveragePooling2D()(x)\n",
    "    \n",
    "    if noise:\n",
    "        x = GaussianNoise(noise)(x)\n",
    "        x = ReLU()(x)\n",
    "        \n",
    "    outputs = Dense(10, activation='softmax',\n",
    "                    kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    \n",
    "    resnet = Model(inputs, outputs)\n",
    "    resnet.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=lr), metrics=['acc'])\n",
    "    return resnet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train a Base Model\n",
    "\n",
    "Let's first a single (non-ensemble) model as our base reference. We will use a learning rate scheduler to train the model at a learning rate of 0.001 for 20 epochs, and then drop the learning rate by a magnitude to 0.0001 for the remaining 10 epochs. After 30 epochs, the validation/test accuracy will be about 84%.\n",
    "\n",
    "Note that the size of the model is just over 1.8 million parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resnet1 = makeBaseModel(reg=l2(0.001), noise=0.1)\n",
    "resnet1.summary()\n",
    "\n",
    "def lr_schedule(epoch, lr):\n",
    "    if epoch < 20:\n",
    "        return 0.001\n",
    "    else:\n",
    "        return 0.0001\n",
    "\n",
    "resnet1.fit(x_train, y_train, epochs=30, batch_size=32, validation_split=0.1, verbose=1, \n",
    "            callbacks=[LearningRateScheduler(lr_schedule)])\n",
    "resnet1.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train Multiple Instances of the Model\n",
    "\n",
    "Next, we will train two additional instances of the same model (three altogether), where each model has a different draw from the random distribution for weight initialization.\n",
    "\n",
    "When we look at the results from the evaluation data for all three models, most often they will be very close to each other. In most runs, you might see the range of difference as little as < 0.5% or as large as 2%. For example, you might see something like [84%, 82.5%, 83%]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resnet2 = makeBaseModel()\n",
    "resnet2.fit(x_train, y_train, epochs=30, batch_size=32, validation_split=0.1, verbose=1, \n",
    "            callbacks=[LearningRateScheduler(lr_schedule)])\n",
    "resnet2.evaluate(x_test, y_test)\n",
    "\n",
    "resnet3 = makeBaseModel()\n",
    "resnet3.fit(x_train, y_train, epochs=30, batch_size=32, validation_split=0.1, verbose=1, \n",
    "            callbacks=[LearningRateScheduler(lr_schedule)])\n",
    "resnet3.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ensemble\n",
    "\n",
    "Let's make a traditional inter-model ensemble. In this case, we will create a new wrapper model ('ensemble'), and include each of the model instances as a branch from the input. Finally, we add the outputs from each model together and do a softmax (effectively an argmax) for our majority vote of the models.\n",
    "\n",
    "Let's compare the results of the inter-model ensemble to the individual model results. One should see a modest boost of ~2% above the best performance of the individual models. For example, if the best performing individual model was 83%, then the inter-model ensemble would be ~85%.\n",
    "\n",
    "Note that the size of this inter-model ensemble is just over 5.6 million parameters (3X as a single model instance)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input to the Ensemble\n",
    "inputs = Input((32, 32, 3))\n",
    "\n",
    "# Each model will be a branch in the ensemble\n",
    "o1 = resnet1(inputs)\n",
    "o2 = resnet2(inputs)\n",
    "o3 = resnet3(inputs)\n",
    "\n",
    "# Implement majority voting by adding their softmax predictions\n",
    "outputs = Add()([o1, o2, o3])\n",
    "outputs = Activation('softmax')(outputs)\n",
    "\n",
    "ensemble = Model(inputs, outputs)\n",
    "ensemble.summary()\n",
    "ensemble.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])\n",
    "ensemble.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Weight Variance across model instances\n",
    "\n",
    "Let's now look at the variance of weights at the same layer across the model instances (each with a different weight initialization draw). One would observe that at the top layer before the softmax activation (referred to as the 'feature vector' or 'embedding') that there is very little variance. So little, as the next section will show, can be interchanged between the models with little to no effect on the output performance.\n",
    "\n",
    "On the otherhand, any layer past the bottleneck layer (not demonstrated) will show a rapid degradation in performance when interchanged between the models. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights1 = resnet1.get_weights()\n",
    "weights2 = resnet2.get_weights()\n",
    "weights3 = resnet3.get_weights()\n",
    "print(\"Number of Weight Matrices\", len(weights1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now do an interchange between the three trained model instances of just the bottom layer. Notice how there is essentially no performance change! While the feature vectors (embeddings) are not identical, they are interchangeable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"models 1, 2\")\n",
    "resnetx12 = makeBaseModel()\n",
    "resnetx12.set_weights( weights1[0:247] + weights2[247:])\n",
    "resnetx12.evaluate(x_test, y_test)\n",
    "resnetx21 = makeBaseModel()\n",
    "resnetx21.set_weights( weights2[0:247] + weights1[247:])\n",
    "resnetx21.evaluate(x_test, y_test)\n",
    "print(\"models 1, 3\")\n",
    "resnetx13 = makeBaseModel()\n",
    "resnetx13.set_weights( weights1[0:247] + weights3[247:])\n",
    "resnetx13.evaluate(x_test, y_test)\n",
    "resnetx31 = makeBaseModel()\n",
    "resnetx31.set_weights( weights3[0:247] + weights1[247:])\n",
    "resnetx31.evaluate(x_test, y_test)\n",
    "print(\"models 2, 3\")\n",
    "resnetx23 = makeBaseModel()\n",
    "resnetx23.set_weights( weights2[0:247] + weights3[247:])\n",
    "resnetx23.evaluate(x_test, y_test)\n",
    "resnetx32 = makeBaseModel()\n",
    "resnetx32.set_weights( weights3[0:247] + weights2[247:])\n",
    "resnetx32.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Intra-Model Bagging\n",
    "\n",
    "What's the difference between inter-model and intra-model ensemble techniques? Inter-model means that each model instance is an independent model, with no shared layers (weights) and no shared training. This method is of higher computational complexity and meets the *traditional* definition of an ensemble (majority vote from a collection of independently trained weak learners).\n",
    "\n",
    "Intra-model ensemble methods go against the traditional method and rely on the concept of the lottery hypothesis; whereby each trained model instance has shared layers and trained together, with a separate classifier. The assumption is that each classifier has an independent draw from the random distribution for weight initializations, and this will give equalivalent results as a traditional ensemble, but with substantial less complexity.\n",
    "\n",
    "We should observe that each of the three classifiers will be very close to each other in performance, typically within 0.025 (1/4 of 1 percent)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def makeBagging(reg=None, n_blocks=4, lr=0.001, noise=None):\n",
    "    ResNetV2.reg = reg\n",
    "    \n",
    "    # Stem\n",
    "    inputs = Input((32, 32, 3))\n",
    "    x = Conv2D(32, (3, 3), strides=(1, 1), padding='same', \n",
    "               kernel_initializer='he_normal', kernel_regularizer=reg)(inputs)\n",
    "    x = BatchNormalization()(x)\n",
    "    x = ReLU()(x)\n",
    "\n",
    "    # Learner\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=16)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=64)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=128)\n",
    "\n",
    "    # Classifier\n",
    "    x = GlobalAveragePooling2D()(x)\n",
    "    \n",
    "    if noise:\n",
    "        x = GaussianNoise(noise)(x)\n",
    "        x = ReLU()(x)\n",
    "    \n",
    "    # Multiple Instances of Classifier (Bagging)\n",
    "    outputs1 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    outputs2 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    outputs3 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    \n",
    "    resnet = Model(inputs, [outputs1, outputs2, outputs3])\n",
    "    resnet.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])\n",
    "    return resnet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resnet_b = makeBagging(reg=l2(0.001), noise=0.1)\n",
    "resnet_b.summary()\n",
    "\n",
    "resnet_b.fit(x_train, [y_train, y_train, y_train], epochs=30, batch_size=32, verbose=1, validation_split=0.1, \n",
    "             callbacks=[LearningRateScheduler(lr_schedule)])\n",
    "resnet_b.evaluate(x_test, [y_test, y_test, y_test])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now save the weights from the trained intra-model bagging."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights_b = resnet_b.get_weights()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Weight Parameterization\n",
    "\n",
    "In the above method, the model reported all three *votes* separately. We now construct the same model to add the step of majority voting. To do this, we will add two new layers to the top of the model. The first layer we add will add the outputs from each of the three classifier layers into a single vector. That is, all three predictions will be summed together for each class -- which is a form of weight parameterization. We will then pass the vector through a softmax activation (which essentially is an argmax in this case) for the final prediction. This does not add any new parameters, and simply implements majority voting.\n",
    "\n",
    "Let's compare the results between the individual classifiers within the model and the bagged classifier. We see that there likely is very little difference, generally in the range of 1/10 to 1/4 of 1% increase.\n",
    "\n",
    "Note that the size of this inter-model ensemble is just over 5.6 million parameters (3X as a single model instance)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def makeBaggingEx(reg=None, n_blocks=4, lr=0.001, noise=None):\n",
    "    ResNetV2.reg = reg\n",
    "    \n",
    "    # Stem\n",
    "    inputs = Input((32, 32, 3))\n",
    "    x = Conv2D(32, (3, 3), strides=(1, 1), padding='same', \n",
    "               kernel_initializer='he_normal', kernel_regularizer=reg)(inputs)\n",
    "    x = BatchNormalization()(x)\n",
    "    x = ReLU()(x)\n",
    "\n",
    "    # Learner\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=16)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=64)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=128)\n",
    "\n",
    "    # Classifier\n",
    "    x = GlobalAveragePooling2D()(x)\n",
    "    \n",
    "    if noise:\n",
    "        x = GaussianNoise(noise)(x)\n",
    "        x = ReLU()(x)\n",
    "\n",
    "    outputs1 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    outputs2 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    outputs3 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    \n",
    "    # Parameterize the weights from all three classifiers back into one classifier\n",
    "    outputs  = Add()([outputs1, outputs2, outputs3])\n",
    "    outputs  = Activation('softmax')(outputs)\n",
    "    \n",
    "    resnet = Model(inputs, outputs)\n",
    "    resnet.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])\n",
    "    return resnet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resnet_bx = makeBaggingEx(reg=l2(0.001), noise=0.1)\n",
    "resnet_bx.summary()\n",
    "\n",
    "resnet_bx.set_weights(weights_b)\n",
    "resnet_bx.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Intra-Model Stacking\n",
    "\n",
    "Another method to intra-model ensemble is stacking. Stacking is similar to the bagging method, except instead of bagging the results (adding together the individual classifiers and do majority voting), we pass the outputs from the pretrained classifiers to a second level classifier (\"the stack\"); whereby the second classifier learns to correct the misclassifications by the preceding models.\n",
    "\n",
    "We will build the model by starting with the prior intra-model bagging model, and then replace the majority voting classifier with a new classifier. To do this, we will concatenate all three output vectors (vs. add) from the three classifiers and pass the concatenated vector to a new classifier layer. We also add some additional Guassian noise (regularization) between the first level and second level classifiers for regularizing the second level classifier -- to address overfitting.\n",
    "\n",
    "Note that the total number of parameters has only gone up slightly from our intra-model bagging model at 1.9 million parameters (vs 1.8 million)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def makeStacking(reg=None, n_blocks=4, lr=0.001, noise=None):\n",
    "    ResNetV2.reg = reg\n",
    "    \n",
    "    # Stem\n",
    "    inputs = Input((32, 32, 3))\n",
    "    x = Conv2D(32, (3, 3), strides=(1, 1), padding='same', \n",
    "               kernel_initializer='he_normal', kernel_regularizer=reg)(inputs)\n",
    "    x = BatchNormalization()(x)\n",
    "    x = ReLU()(x)\n",
    "\n",
    "    # Learner\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=16)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=64)\n",
    "    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=128)\n",
    "\n",
    "    # Classifier\n",
    "    x = GlobalAveragePooling2D()(x)\n",
    "    \n",
    "    if noise:\n",
    "        x = GaussianNoise(noise)(x)\n",
    "        x = ReLU()(x)\n",
    "\n",
    "    outputs1 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    outputs2 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    outputs3 = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(x)\n",
    "    \n",
    "    # Stacking\n",
    "    outputs  = Concatenate()([outputs1, outputs2, outputs3])\n",
    "    outputs  = Dense(10, activation='softmax',\n",
    "                     kernel_initializer='he_normal', kernel_regularizer=reg)(outputs)\n",
    "    \n",
    "    resnet = Model(inputs, outputs)\n",
    "    resnet.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])\n",
    "    return resnet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resnet_s = makeStacking(reg=l2(0.001), noise=0.1)\n",
    "resnet_s.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will copy over the pretrained weights for the first level classifier (weights_b[:247]). Next we set all the layers of the first level classifier to non-trainable; i.e., we will only train the second level classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights_s = resnet_s.get_weights()\n",
    "resnet_s.set_weights(weights_b[:247] + weights_s[247:])\n",
    "\n",
    "for _ in range(len(resnet_s.layers)-1):\n",
    "    resnet_s.layers[_].trainable = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will train the second level classifier with low learning rate (0.0001). Observe that after a few epochs, the validation accuracy plateaus out around the same as for the intra-model bagging version -- i.e., it does not appear in this scenario that we are learning to correct the mistakes of the first level classifier; we are still overfitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resnet_s.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.0001), metrics=['acc'])\n",
    "resnet_s.fit(x_train, y_train, epochs=10, batch_size=32, verbose=1, validation_split=0.1)\n",
    "resnet_s.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next\n",
    "\n",
    "Think how you can modify this experiment, to meet the objectives."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
